{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  10.1变分推断"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于普通函数$f(x)$,可以认为$f$是一个关于$x$的实数算子，将实数$x$映射到实数$f(x)$，类比该模式，存在算子$F$,是关于$f(x)$的函数算子，将$f(x)$映射成实数$F(f(x))$。例如信息熵$H(p(x))$，将概率密度$p(x)$映射为一个具体的值:\n",
    "$$H[p]=-\\int p(x)lnp(x)dx$$\n",
    "\n",
    "由此引入泛函导数的概念，它表达了输⼊函数产⽣⽆\n",
    "穷⼩的改变时，泛函的值的变化情况"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "假设一个纯粹的贝叶斯模型，其中每个参数都有⼀个先验概率分布。这个模型也可以有潜在变量以及参数，我们会把所有潜在变量和参数组成的集合记作Z，把所有观测变量的集合记作X。数据独立同分布,$X=\\{x_1,...,x_N\\},Z=\\{z_1,...,z_N\\}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "概率模型确定了联合概率分布$p(X,Z)$，我们的⽬标是找到对后验概率分布$p(Z | X)$以及模型证\n",
    "据$p(X)$的近似。与我们关于EM的讨论相同，我们可以将对数边缘概率分解，即\n",
    "$$lnp(x)=\\mathcal L(q)+KL(q\\lVert p)$$\n",
    "其中:\n",
    "$$\\mathcal L(q)=\\int q(Z)ln\\{\\frac{p(X,Z)}{q(Z)}\\}dZ$$\n",
    "$$KL(q\\lVert p)=- \\int q(Z)ln\\{\\frac{p(Z|X\n",
    ")}{q(Z)}\\}dZ$$\n",
    "\n",
    "$lnp(X)$为常数，所以最大化下界$\\mathcal L(q)$等价于最小化KL散度，需要考虑一个概率分布$q(Z)$的受限制类别"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10.1.1分解概率分布 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "平均场理论假设：\n",
    "$$q(Z)=\\prod _{i=1}^M q_i(Z_i)$$\n",
    "\n",
    "对$\\mathcal L(q)$关于所有概率分布$q_i(Z_i)$进行变分最优化，分离出只依赖一个因子$q_j(Z_j)$,记做$(q_j)$,有：\n",
    "$$\\begin{align}\n",
    "\\mathcal L(q)&=\\int \\prod_iq_i\\bigg \\{lnp(X,Z)-\\sum_ilnq_i\\bigg \\}dZ\\\\\n",
    "&=\\int q_j\\bigg \\{\\int lnp(X,Z)\\prod_{i\\neq j}q_idZ_i\\bigg \\}dz_j-\\int q_ilnq_jdZ_j+const\\\\\n",
    "&=\\int q_j ln\\tilde p(X,Z_j)dZ_j-\\int q_jlnq_jdZ_j+const\\\\\n",
    "\\end{align}$$\n",
    "\n",
    "其中，我们定一个新的概率分布，即所有$z_i(i\\neq j)$上的q概率分布的期望\n",
    "$$\\begin{align}\\int q_j ln\\tilde p(X,Z_j)&=\\int lnp(X,Z)\\prod_{i \\neq j}q_idZ_j+const\\\\\n",
    "&=\\mathbb E_{i \\neq j}[lnp(X,Z)]+const\\end{align}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因为$\\mathcal L(q)$实际上是$q_j(Z_j)$和$\\tilde p(X,Z_j)$之间的KL散度的负值，在$q_j(Z_j)=\\tilde p(X,Z_j)$处有最小值,最优解为:\n",
    "$$lnq_j^*(Z_j)=\\mathbb E_{i \\neq j}[lnp(X,Z)]+const\\tag{*}$$\n",
    "\n",
    "$$q^*_j(Z_j)=\\frac{exp(\\mathbb E_{i \\neq j}[lnp(X,Z)])}{\\int exp(\\mathbb E_{i \\neq j}[lnp(X,Z)])dZ_j}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⾸先，恰当地初始化所有的因⼦qi(Zi)然后\n",
    "在各个因⼦上进⾏循环，每⼀轮⽤⼀个修正后的估计来替换当前因⼦。这个修正后的估计由公\n",
    "式的右侧给出，计算时使⽤了当前对于所有其他因⼦的估计。算法保证收敛，因为下界\n",
    "关于每个因⼦$q_i(Z_i)$是⼀个凸函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10.1.2分解的近似性质"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "假设我们最⼩化相反的Kullback-Leibler散度$KL(p\\lVert q)$\n",
    "\n",
    "$$KL(p\\lVert q)=- \\int p(Z)\\bigg [\\sum^M_{i=1}lnq_i(Z_i)\\bigg ]dZ+const$$\n",
    "\n",
    "对每个因子最优化有:\n",
    "$$q^*_j(Z_J)=\\int p(Z)\\prod_{i\\neq j}dZ_i=p(Z_j)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* $KL(q \\lVert p)$:希望$p(Z)$为0的地方$q(Z)$也为0,否则$\\frac{p(Z)}{q(Z)}$会有很大的正数贡献，最⼩化这种形式的KL散度会使得概率分布$q(Z)$避开$p(Z)$很⼩的区域\n",
    "* $KL(p \\lVert q)$:希望$p(Z)$不为0的地方$q(Z)$也不为0,否则$\\frac{q(Z)}{p(Z)}$会有很大的正数贡献\n",
    "* 所以，$KL(q \\lVert p)$得到的近似分布q(x)会比较穻，因为它希望q(x)为0的地斱可能比较多；$KL(p \\lVert q)$而得到的近似分布q(x)会比较宽，因为它希服q(x)不为0的地方比较多\n",
    "\n",
    "⽤⼀个单峰分布近似多峰分布的问题时，基于最⼩化$KL(q \\lVert p)$的变分⽅法倾向于找到这些峰值中的⼀个。相反，如果我们\n",
    "最⼩化$KL(p \\lVert q)$，那么得到的近似会在所有的均值上取平均\n",
    "<center>\n",
    "    <img src=\"../pic/10_3.png\">\n",
    "</center>\n",
    "\n",
    "如上图，在实际应用中真实的后验概率分布是多峰的，图a是最小化$KL(p \\lVert q)$的结果，近似值在所有的均值上平均，b、c最小化$KL(q \\lVert p)$可以找到这些峰值中的一个"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10.1.3 例子:一元高斯分布"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们的⽬标\n",
    "是在给定x的观测值的数据集$D=\\{x_1,...,x_N\\}$的情况下，推断均值µ和精度τ的后验概率分布。\n",
    "其中，我们假设数据是独⽴地从⾼斯分布中抽取的。似然函数为"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$p(D|\\mu,\\tau)=(\\frac{\\tau}{2\\pi})^{\\frac{N}{2}}exp\\bigg \\{-\\frac{\\tau}{2}\\sum^N_{n=1}(x_n-\\mu)^2\\bigg \\}$$\n",
    "引入$\\mu$和$\\tau$的共轭先验分布:\n",
    "$$p(\\mu|\\tau)=\\mathcal N(\\mu|\\mu_0,(\\lambda\\tau)^{-1})$$\n",
    "$$p(\\tau)=Gam(\\tau|a_0,b_0)$$\n",
    " 些 分 布 共 同 给 出 了 ⼀ 个 ⾼\n",
    "斯-Gamma共轭先验分布。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对后验概率分布进行分解变分近似:\n",
    "$$q(\\mu,\\tau)=q_{\\mu}(\\mu)q_{\\tau}(\\tau)$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于$q_{\\mu}(\\mu)$，$Z=(\\mu,\\tau)$,可以得出:\n",
    "$$\\begin{align}lnq^*_{\\mu}(\\mu)&=\\mathbb E_{\\tau}\\bigg [lnp(D|\\mu,\\tau)+lnp(\\mu|\\tau)\\bigg ]+const\\\\&=-\\frac{\\mathbb E[\\tau]}{2}\\bigg \\{\\lambda_0(\\mu-\\mu_0)^2+\\sum_{n=1}^N(x_n-\\mu)^2\\bigg \\}+const\\end{align}$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "所以$q_{\\mu}(\\mu)\\mbox{为}\\mathcal N(\\mu|\\mu_N,\\lambda_N^{-1})$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\mu_N=\\frac{\\lambda_0\\mu_0+N\\overline x}{\\lambda_0+N}$$\n",
    "$$\\lambda_N=(\\lambda_0+N)\\mathbb E[\\tau]$$\n",
    "\n",
    "类似的因子$q_{\\tau}(\\tau)$的最优解为:\n",
    "$$\\begin{align}\n",
    "lnq^*_{\\tau}(\\tau)&=\\mathbb E_{\\mu}[lnp(D|\\mu,\\tau)+lnp(\\mu|\\tau)]+lnp(\\tau)+const\\\\&=(a_0-1)ln\\tau-b_0\\tau+\\frac{N+1}{2}ln\\tau-\\frac{\\mathbb \\tau E_{\\mu}}{2}\\bigg \\{\\lambda_0(\\mu-\\mu_0)^2+\\sum_{n=1}^N(x_n-\\mu)^2\\bigg \\}+const\\end{align}$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因此$q_{\\tau}(\\tau)$是一个gamma分布$Gam(\\tau|a_N,b_N)$:\n",
    "$$a_N=a_0+\\frac{N+1}{2}$$\n",
    "$$b_N=b_0+\\frac{1}{2}\\mathbb E_{\\mu}\\bigg [\\sum_{n=1}^N(x_n-\\mu)^2+\\lambda_0(\\mu-\\mu_0)^2\\bigg ]$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用Gamma分布中的均值的标准结果$\\mathbb E[\\tau]=\\frac{a_N}{b_N}$，可以得到:\n",
    "$$E[\\tau]^{-1}=\\frac{N}{N+1}(\\overline x^2-2\\overline x\\mathbb E[\\mu]+\\mathbb E[\\mu^2])$$\n",
    "再代入$q_{\\mu}(\\mu)$的一阶距和二阶距$\\mathbb E[\\mu]=\\overline x$,$\\mathbb E[\\mu]=\\overline x^2+\\frac{1}{N\\mathbb E[\\mu]}$:\n",
    "$$E[\\tau]^{-1}=\\frac{1}{N}=\\sum^N_{n=1}(x_n-\\overline x)^2$$"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
